About the project

This is the first time I have opened RStudio, so I am feeling a bit confused. But I am sure that this enviroment will become more familiar during this course. I think that this course will be interesting since it seems to be a course where you learn by doing. The course is also about two contemporary subjects: open data and open science. Also, it is about two really contemporary subjects: open data and open science. What I expect from this course are at least the following things:

I found out about this course from Weboodi when I was planning what courses I should take this semester.

Here’s a link to my github:

https://github.com/mikkohyy/IODS-project

Here’s another way of creating a link with R Markdown.


Chapter 2: Regression and model validation

Setting up

library(GGally)  
library(ggplot2)  
setwd("data/")  
learning_data <- read.csv(file="learning2014.csv")  

1

str(learning_data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning_data)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

This is a dataset that consists of 7 variables and 166 rows - or data about 166 students. The variables could be classified to three groups: 1) background information (gender, age, attitude towards statistic), 2) the learning style of the student (deep, strategic, surface) and how well the student did on the test (points). The data was collected from students of a statistic course.

2

ggpairs(learning_data,mapping=aes(col=gender,alpha=0.3),lower=list(combo=wrap("facethist",bins=20)),upper=list(continuous=wrap("cor",size=3)))  

summary(learning_data)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The students are quite young (median=22) and there are more females (110) than males (56). The learning style of the students have more charasteristics of “deep learning” (mean=3.68) than “surface learning” (mean=2.787). The strongest correlation that can be found is between attitude and points (0.437). Other interesting, but not that strong, correlations are deep & attitude, stra & age, stra & points, and surf & points (negative).

Surf is interesting since it has a negative correlation with various variables. But this is also quite self-evident since, for example, deep learning and surface learning are the opposites. There are also some interesting correlations with gender - e.g. with males age has a negative correlation with points.

I think that the general impression is that attitude has a strong correlation with points and surface learning style correlates negatively with various variables.

3

I build my model on the assumption that “a good attitude” and the combination of deep and strategic learning will lead to good results in the test. Seems reasonable, right? Hence, I will use attitude, deep and stra in my model:

model <- lm(points~attitude + deep + surf, data=learning_data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + deep + surf, data = learning_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3551     4.7124   3.895 0.000143 ***
## attitude      3.4661     0.5766   6.011 1.18e-08 ***
## deep         -0.9485     0.7903  -1.200 0.231815    
## surf         -1.0911     0.8360  -1.305 0.193669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08

From the summary of the model we can see that only attitude can be considered to be statistically highly significant (P<0.001 or ***) variable in this case. Or, in other words, there is a really low probability that this variable (in this case attitude) is not relevant in relation to the target variable. Hence, we can conclude that there is a statistical relationship between attitude and points.

Since deep and surf were not statistically significant we change our model. Our new model is simple and will only have attitude as explanatory variable:

new_model <- lm(points~attitude, data=learning_data)

4

summary(new_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

What this model tells us is that high value on attitude seems to increase the points a student gets from a test. In other words, every point on attitude increases the points by an average of 3.53. This is quite logical since I think that scoring high on attitude implies that a student is motivated.

The multiple R-squared basically describes how well “the model fits the set of observations” (this was the definition in the blog). Basically, in this case, it tells that how much of the variation in the points is explained only by attitude. This model’s multiple R-squared is 0.1906 which means that roughly 19% of the variation is explained by attitude.

5

There are some assumptions in regression models. Naturally, one of the assumptions is linearity. Other one is that errors are normally distributed. And it possible to analyze the validity of these assumptions by analyzing the residuals of the models. This is done by asking if the following assumptions are valid: 1) the errors of the model are normally distributed, 2) the size of the errors should not depend on the explanatory variables (or that errors have constant variance), and 3) the errors are not correlated.

par(mfrow=c(2,2))
plot(model,which=c(2,1,5))

Residuals vs Fitted values

Residuals vs Fitted values makes it possible to explore if there is problems with the constant variance assumption. Since there does not seem to be any pattern in the scatter plot, there does not seem to be any problems with this assumption.

Normal QQ-plot

Normal QQ-plot makes it possible to explore if there are problems that relate to the assumption of normal distribution of errors. It seems that “the dots fit the line” relatively well. Although there are some anomalities in the beginning and the end. But I would conclude that there is no problems with this assumption.

Residuals vs Leverage

Residual vs Leverage makes it possible to explore if there are some observations that have a high impact on the model - e.g. if there is a value that has a very high leverage it makes the model less reliable. In my model no single value has a high leverage. Nothing “pulls” and no single value has a high leverage.


Chapter 3: Logistic regression

library(ggplot2)
library(dplyr)
library(boot)
setwd("data/")
alc_data <- read.csv("part3_alc_data.csv")

2

names(alc_data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc_data)
## [1] 382  35

The data is about students in “secondary education of two Portugese schools”. In this data set there are 382 students and 35 variables. The variables describe various attributes such as information about student’s home, their afterschool activities and how well they do in school. The data set was collected “using school reports and questionnaires”. The ages of the students are mostly between 15 and 18 (although there are few that are slightly older).

The data used in this exercise has been combined from two different data sets (first one about the performance on math and the second one Portugese language). The students who were in both of those data sets were selected for this data set. Numeric variables were combined by taking the average of the variables of different data sets - i.e. grades on this data set are the average of the grades of math and Portugese language. Hence, it can be claimed to indicate how well the student did in general

(As a side note, the data I am using (which I wrangled) is different from the data that is linked in the analysis assessment. The difference is in grades. It seems that in the data that was already wrangled the grades are the math-grades (I understood that it should be the average of math and por). So my results may be different from those that are based on the already wrangled data..)

3

The variables I chose were:

  • G3 (final grade): I go with the very traditional assumption that if you get good grades on school (especially if you are 15-18 years old) you probably will not be one of those who are using lots of alcohol. Hence, the higher the grade, the less likely the student has a “high_use”.
  • goout (going out with friends): The more a student goes out with friends the more chances they have to consume alcohol. Hence the more you go out with your friends, the more likely it is that you have a “high_use”. And there is also always peer-pressure.
  • studytime (weekly study time): If a student uses a lot of time to study, it indicates that they are taking their school/future seriously and probably will not be one of those who consume alcohol in high quantity. Other possibility is that they just do not have time to be in environments where alcohol consumption is usually done. Hence, the more time a studend spends studying, the less likely is that they have a high alcohol consumption.
  • internet (Internet access at home): I though that asking does having an Internet access at home predict anything about alcohon consumption would be interesting. My hypothesis is that if you have internet at home, you do not have time for alcohol since you are playing Fortnite.

4

summary(select(alc_data,.data$G3,.data$goout,.data$studytime,.data$internet))
##        G3            goout         studytime     internet 
##  Min.   : 0.00   Min.   :1.000   Min.   :1.000   no : 58  
##  1st Qu.:10.00   1st Qu.:2.000   1st Qu.:1.000   yes:324  
##  Median :12.00   Median :3.000   Median :2.000            
##  Mean   :11.46   Mean   :3.113   Mean   :2.037            
##  3rd Qu.:14.00   3rd Qu.:4.000   3rd Qu.:2.000            
##  Max.   :18.00   Max.   :5.000   Max.   :4.000

It seems that there are outgoing persons (mean 3.113) and most of them have internet at home. The grades seem to be skewed towards higher grades and the students do not seem to that exited about using their time studying (mean is 2.037).

Box and bar plots

par(mar=c(4,4,0.2,0.1))
g.g3 <- ggplot(data=alc_data,aes(x=high_use,y=G3,fill=high_use))
g.g3 + geom_boxplot() + ylab("Final grade")
g.going_out <- ggplot(data=alc_data,aes(x=goout,fill=high_use))
g.going_out + geom_bar() + ylab("Going out")
g.studytime <- ggplot(data=alc_data,aes(x=studytime,fill=high_use))
g.studytime + geom_bar()
g.internet <- ggplot(data=alc_data,aes(x=internet,fill=high_use))
g.internet + geom_bar()

There is at least some differences in grades in relation to high_use - e.g. the median is a bit higher. From the Going out barplot we can see that when goout>3 the group that has high_use becomes much bigger than the group that does not have high_use. With the cases of internet, studytime and goout we get more insight with crosstables:

t.internet <- table(alc_data$internet,alc_data$high_use,dnn=c("internet","high_alc"))
addmargins(round(prop.table(t.internet)*100,1)) 
##         high_alc
## internet FALSE  TRUE   Sum
##      no   11.3   3.9  15.2
##      yes  58.9  25.9  84.8
##      Sum  70.2  29.8 100.0
t.studytime <- table(alc_data$studytime,alc_data$high_use,dnn=c("studytime","hich_alc"))
addmargins(round(prop.table(t.studytime)*100,1))
##          hich_alc
## studytime FALSE TRUE  Sum
##       1    15.2 11.0 26.2
##       2    35.3 15.7 51.0
##       3    13.6  2.1 15.7
##       4     6.0  1.0  7.0
##       Sum  70.1 29.8 99.9
t.goout <- table(alc_data$goout,alc_data$high_use,dnn=c("goout","high_alc"))
addmargins(round(prop.table(t.goout)*100,1))
##      high_alc
## goout FALSE  TRUE   Sum
##   1     5.0   0.8   5.8
##   2    22.0   4.2  26.2
##   3    27.0   6.0  33.0
##   4    10.7  10.5  21.2
##   5     5.5   8.4  13.9
##   Sum  70.2  29.9 100.1

When compared to my hypotheses, these explorations lead to following conclusions:

  • The higher the grade, the less likely the student has a ‘high_use’
    • It seems that those who were classified to have low alcohol use had a bit more better final grades (but the spread with those who do not have high use was lager).
  • The more you go out with friends, more likely it is that you will have a ‘high_use’
    • It seems that those who go out more probably also have a “high_use”.
  • If the student uses a lot of time to study they do not have time/interests to consume alcohol. Therefore, they do not probably have ‘high_use’.
    • I would not say that this was dissaproved. Those who use more time studying will probably also have lowe alcohon consumption.
  • If you have internet at home, you do not have time for alcohol since you are playing Fortnite
    • My hypothesis seems to be wrong.

5

the_model <- glm(high_use~G3+goout+studytime+internet,data=alc_data, family="binomial")
summary(the_model)
## 
## Call:
## glm(formula = high_use ~ G3 + goout + studytime + internet, family = "binomial", 
##     data = alc_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7427  -0.7849  -0.5479   0.9523   2.5187  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.80189    0.69364  -2.598 0.009385 ** 
## G3          -0.03626    0.03875  -0.936 0.349408    
## goout        0.72858    0.11802   6.173 6.68e-10 ***
## studytime   -0.57985    0.16616  -3.490 0.000483 ***
## internetyes  0.11896    0.35473   0.335 0.737368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 399.89  on 377  degrees of freedom
## AIC: 409.89
## 
## Number of Fisher Scoring iterations: 4

What I find surprising is that G3 (or final grade) is not significant in realtion to high_use. But goout and studytime are significant. And as I thought, having internet at home does not have any significance. From the model we can say that going out has a positive connection (i.e. increase in goout will increase the odds of high_use being true) of and studytime has a negative connection with high_use being true. G3 seems to have a really small negative connection and having internet at home has a positive connection (i.e. having a internet increases the odds of having a high_alc) - but neither of them is significant.

The difference between null deviance (465.68) and Residual deviance (399.89) basically describes how good fit the model is. In the case of this model the difference is 65.79. Which indicates that this model is not bad.

Odds ratios and confidence intervals:

oddr <- coef(the_model) %>% exp
confinterv <- confint(the_model) %>% exp
cbind(oddr,confinterv)
##                  oddr     2.5 %    97.5 %
## (Intercept) 0.1649873 0.0409081 0.6261624
## G3          0.9643898 0.8936694 1.0407265
## goout       2.0721458 1.6539622 2.6296307
## studytime   0.5599823 0.3999663 0.7686348
## internetyes 1.1263211 0.5709167 2.3115264

When we look at the oddr-column we see how the variables affect the odds of having high_use (in this context >1 indicates positive relationship). Goout has relatively large positive relationship - it implies that when goout increases with 1, the odds that the student has high_use increases by 2 (or puts student in 2 times greater odds of having high_use). Confidence intervals suggest that we can say that we can be fairly sure that goout has positive connection and also that studytime has a negative connection. But for example having internet at home is quite useless since its confidence interval is quite wide.

My hypothesis on internet was proved to be quite wrong if we look at the model (it has a positive connection). Another surprising thing is that G3 does not have a connection (or it has only negative 0.03) with alcohol use. Therefore, my hypothesis based on traditional thinking was false. Goout and studytime seemed to have the role that I though they would. Next question is how well my model predicts.

6

Graph:

better_model <- glm(high_use~goout+studytime,data=alc_data, family="binomial")
probs <- predict(better_model, type="response")
alc_data <- mutate(alc_data,probability=probs)
alc_data <- mutate(alc_data, prediction=probability>0.5)
plot <- ggplot(alc_data,aes(x=probability,y=high_use,col=prediction))
plot + geom_point()

Crosstable:

pred.table <- table(high_use=alc_data$high_use,prediction=alc_data$prediction)
addmargins(prop.table(pred.table))
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64659686 0.05497382 0.70157068
##    TRUE  0.19109948 0.10732984 0.29842932
##    Sum   0.83769634 0.16230366 1.00000000

Penalty:

lossf <- function(class,prob) {
    wrong <- abs(class-prob) > 0.5
  mean(wrong)
}
lossf(class=alc_data$high_use,prob=alc_data$probability)
## [1] 0.2460733

This model’s penalty is approximately 0.25 which means that it predicted 75% of the cases right. It seems that my model has difficulties predicting cases that have high_use. It predicted 19% of cases wrong when there was a high_use (it claimed that these cases did not have high_use).

I think that random guessing would be a straighforward and simple guessing strategy:

guess <- sample(c(TRUE,FALSE),size=382,replace=TRUE)
wrong <- alc_data$high_use==guess
mean(wrong)
## [1] 0.5026178

I ran this few times and the penalty was always somewhere between 0.45 and 0.55. Hence, it seems that my model predicts better than just guessing randomly.

7

crossv <- cv.glm(alc_data,cost=lossf,glmfit=better_model,K=10)
crossv$delta[1]
## [1] 0.2460733

According to crossvalidation, my model seems to be a bit better than the on in Data Camp (error of ~0.25 is slightly better than error of ~0.26 - but only slightly).

8

At the beginning there is a model with 10 variables. After each “round” one varible is removed until we have the same model (goout and studytime) that was build in the earlier sections. In the following example there is the code for the first round. For second round I removed traveltime and continued this until there were only two variables in the model.

massive_model <-  glm(high_use~goout+studytime+Fedu+Fjob+Medu+Mjob+freetime+romantic+sex+traveltime,data=alc_data, family="binomial")
crossv <- cv.glm(alc_data,cost=lossf,glmfit=massive_model,K=10)
penalty <- lossf(class=alc_data$high_use,prob=alc_data$probability)
results <- matrix(nrow=9,ncol=3)
results[1,] <- c(10,penalty,crossv$delta[1])

# Matrix of each round:
results
##       [,1]      [,2]      [,3]
##  [1,]   10 0.2094241 0.2434555
##  [2,]    9 0.2146597 0.2356021
##  [3,]    8 0.2251309 0.2565445
##  [4,]    7 0.2277487 0.2617801
##  [5,]    6 0.2146597 0.2486911
##  [6,]    5 0.2277487 0.2408377
##  [7,]    4 0.2146597 0.2356021
##  [8,]    3 0.2460733 0.2356021
##  [9,]    2 0.2460733 0.2460733
results.df <- as.data.frame(results)
ggplot(data=results.df,aes(x=V1)) + geom_line(aes(y=V2,colour="V2")) + geom_point(aes(y=V2,colour="V2")) + geom_line(aes(y=V3,colour="V3")) + geom_point(aes(y=V3,colour="V3")) + scale_x_continuous(trans = "reverse", breaks = unique(results.df$V1)) + ylab("Error") + xlab("Number of predictors")

Red line (V2) describes the value that was produced by the loss function. Blue line (V3) is the value that was produced by cv.glm.


Chapter 4: Clustering and classification

library(MASS)
library(GGally)  
library(ggplot2)
library(tidyverse)
library(corrplot)
data(Boston)

2

dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston dataset is from MASS package (which is a supplementary package to a book about applied statistics). It has 506 dimensions (or observations) and 14 variables. Its content is described as “Housing Values in Suburbs of Boston”. The variables are quite heterogenous. Some examples of these variables are crime rate by town, nitrogen oxides concentration and pupil-teacher ratio by town. All variables are numerical (although chas is a dummy variable).

3

First, let’s look at the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

These summaries tell us that the scales of these variables are quite different from each other. There are variables with max value of 8.780 (rm) and variables with max value of 711.0 (tax). Some of these variables seem to have a bit odd structure - e.g. crime rate’s (crim) median is 0.25 but the maximum is ~88.98.

Then a plot that is made with ggpairs (I adjusted the theme a bit to make it more readable):

ggpairs(Boston,upper=list(continuous=wrap("cor",size=2.2)), lower=list(continuous=wrap("points",size=0.5))) + theme(axis.text=element_text(size=5),panel.grid.major=element_blank(),axis.ticks=element_blank(),panel.border=element_rect(linetype="solid", colour="black",fill=NA),strip.text=element_text(size=7))

There are only few variables that are somewhat normally distributed (rm - which is the average number of rooms per dwelling). Most of the variables are skewed and/or bimodal. From the scatter plots we can see find out at least one clear (negative) correlation between median value of homes (medv) and the lower status% (stat). There are also some variables of which most of its values are in the “lower”end of the scale (e.g. crim & chas). The correlations are probably much easier to outline from a correlation matrix:

cor_mtx <- round(cor(Boston),2)
corrplot(round(cor(Boston),2),type="upper",tl.cex=0.7)

There are some negative correlation especially between dis (distance from Boston employment centres) - one of them is a negative correlation with age. Tax has a strong positive correlation with various variables - e.g. rad (accessibility to radial highways). But at least now it is difficult to say much about this data since there does not seem to be any clear patterns.

4

Since the scales of the variables were so all over the place (and that is bad for clustering), we need to scale the data:

boston_sc <- scale(Boston)
boston_sc <- as.data.frame(boston_sc) # scale -function outputs a matrix and we have to transform it into a data frame
summary(boston_sc)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaling changed the data in a way which made the mean of all variables 0. It made also the variables to resemble each other more - for example tax’s scale was from 187.0 to 771.0 and after scaling it is from -1.3127 to 1.1764. This step makes sense if we are interested about the distances between the cases that are described in this dataset.

Then we will craate a categorical variable called crime which is based on the quantiles of crim variable:

brks <- quantile(boston_sc$crim)
lbls <- c("low","med_low","med_high","high")
crime <- cut(boston_sc$crim,breaks=brks,label=lbls,include.lowest=TRUE)
boston_sc$crime <- crime
boston_sc <- boston_sc[,-1] # Remove the old crim variable
summary(boston_sc$crime)
##      low  med_low med_high     high 
##      127      126      126      127

Then let’s divide the data into training (80%) and testing sets (20%):

n <- nrow(boston_sc)
ind <- sample(n, size=n*0.8)
train_set <- boston_sc[ind,]
test_set <- boston_sc[-ind,]

Now we have two sets. First, train_set has randomly chosen 80% cases of the Boston data. Second, test_set has randomly chosen 20% of cases.

5

First we will do a linear discriminant analysis on the training set (train_set). The categorical crime variable is the target variable and rest of the variables are used as predictors:

boston_lda_train <- lda(crime~., data=train_set)
boston_lda_train
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2549505 0.2425743 0.2475248 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.93707168 -0.8950517 -0.08120770 -0.8763670  0.41384327
## med_low  -0.08760862 -0.3259310 -0.11943197 -0.5868836 -0.13222881
## med_high -0.39886078  0.3100568  0.12941585  0.4345614  0.07837795
## high     -0.48724019  1.0171519 -0.03610305  1.0400186 -0.41395857
##                 age        dis        rad        tax     ptratio
## low      -0.9195960  0.8827680 -0.6819317 -0.7495354 -0.48935046
## med_low  -0.3811273  0.3466925 -0.5481298 -0.4906543 -0.05166147
## med_high  0.4431831 -0.4180251 -0.3924029 -0.2482491 -0.26980126
## high      0.8087396 -0.8524956  1.6377820  1.5138081  0.78037363
##               black       lstat         medv
## low       0.3854803 -0.78098247  0.535965403
## med_low   0.3200155 -0.17370553  0.005616952
## med_high  0.1126910 -0.00931691  0.171399316
## high     -0.7907896  0.84894056 -0.684868439
## 
## Coefficients of linear discriminants:
##                 LD1         LD2          LD3
## zn       0.09970677  0.55560692 -0.919098027
## indus    0.08007020 -0.32367738  0.101386152
## chas    -0.05623760 -0.03025670  0.061209166
## nox      0.35139218 -0.80067765 -1.310355717
## rm      -0.08245175 -0.07998685 -0.145962683
## age      0.30043988 -0.40598827 -0.083843163
## dis     -0.05699972 -0.27602604 -0.005234958
## rad      2.98852557  1.06284963 -0.211653580
## tax     -0.03092167 -0.11791123  0.559700275
## ptratio  0.09249473 -0.01631742 -0.157751394
## black   -0.12078987 -0.02579590  0.084102690
## lstat    0.19325486 -0.06740725  0.421736142
## medv     0.15624251 -0.35591055 -0.214918759
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9424 0.0439 0.0137

Then a (bi)plot of the LDA (with colors based on the crime rate - for some reason pch=crime_as_n, did not work in this plot):

crime_as_n <- as.numeric(train_set$crime)
plot(boston_lda_train,dimen=2,col=crime_as_n)

The end result looks a bit different if we compare it to the plot that was in data camp exercise (I do not know if this is a bad thing, and if it is a bad thing, how bad it is).

6

Then we will test how good the model we created works on a data that it has not seen before (test_set). First thing to do is to remove the “right answers” from the test_set (crime is the column number 14):

correct_from_test <- test_set[,14]
test_set <- test_set[,-14]

Then we will use the lda model to classify the cases (their crime rates) from the test_set and illustrate the results with a cross-table:

boston_pred <- predict(boston_lda_train,newdata=test_set)
addmargins(table(correct=correct_from_test,predicted=boston_pred$class))
##           predicted
## correct    low med_low med_high high Sum
##   low       15       9        0    0  24
##   med_low    4      13        6    0  23
##   med_high   1      15       12    0  28
##   high       0       0        0   27  27
##   Sum       20      37       18   27 102

It seems that this model was quite good at classifying the cases from the test_set (78/102 were classified right). Although, in the case of med_low, it only managed to get 15/29 right. The whole model managed to place the case in the right “basket of crime rate” in about 76% of the cases. I guess that this could be considered as relatively good success rate.

7

Re- loading and standardizing the Boston dataset:

data(Boston)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)

First we will calculate the distances between the observations. The very basic method, euclidean, is used in this calculation:

boston_euc_dis <- dist(boston_again, method="euclidean")
summary(boston_euc_dis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

For evaluating the best number of clusters I will use the function that was introduced in Data Camp exercice (max number of clusters in this evaluation is 10):

set.seed(22)
cluster_n <- 10
wcss <- sapply(1:cluster_n, function(k){kmeans(boston_again,k)$tot.withinss})
qplot(x=1:cluster_n,y=wcss,geom="line")

I think that the best number of clusters is 3 since the distances do not decrease much after that point. Another possible number could be 5, but when using it in a visualization, the result is quite messy. Hence, I will go with the classic approach of “less is more”. Next, is the visualization of the clusters when the number of clusters is 3:

boston_km <- kmeans(boston_again,centers=3)
pairs(boston_again,col=boston_km$cluster,cex=0.2)

It seems that the “black cluster” jumps out in each plot (of course it could also be that it is because of its color which is much more visible from the small graphs). There are three variables that seem to have (in most cases) clear clusters in them: black, crim and lstat. My interpration is that the variables black, crim and lstat have an important role in the model and classification - i.e. the clusters are heavily influenced by these variables. It also seems that the cluster that has the color black has “a strong identity” since it is clearly visible in most of the plots. Hence, it is clearly different from the other clusters.

Bonus

I will perform the k-means with four clusters (the arrow-function I use is the same that was in the Data Camp exercise):

data(Boston)
set.seed(22)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
boston_km <- kmeans(boston_again,centers=4)
boston_again$cluster <- boston_km$cluster
boston_lda <- lda(cluster~., data=boston_again)
plot(boston_lda,dimen=2,col=boston_again$cluster)
lda.arrows(boston_lda, myscale = 2)

It seems that he most influencial individual variable is black. Other variables that have a relatively strong influence are crim, indus, nox and tax. Variables black and crime seem to “pull” in their “own ways” and most of the variables are in a group that “pulls” to left. The fact that black seems to be influential confirms earlier observations (e.g. 7th part of this exercise).

Super-bonus:

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(boston_lda_train$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% boston_lda_train$scaling
matrix_product <- as.data.frame(matrix_product)

# install.packages("plotly")
library(plotly)

plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=crime_as_n,size=I(40))
plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=cluster_col,size=I(40))

How the colors are scattered in these plots resemble each other relatively well. Based on these two plots, I would say that crime-variable seems to have relatively large role when the algorithm defines the cluster. These two plots illustrate it, since the point colors of clusters seem to be scattered quite similarly to the crime rate.